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Articles published in the early years of the Journal of the American Statistical 
Association, i.e., 1888-1910s, posited new theories mostly by using arithmeti- 
cal arguments. Starting around the mid-1910s the arguments became algebraic 
in nature and by the 1920s this trend was well established. Today, a century 
later, in addition to cogent mathematical arguments, new statistical develop- 
ments are becoming computational, such is the power and influence of the 
modern computer (a device un-dreamed of in those earlier days). Likewise, 
we see enormous changes in the size and nature of assembled data sets for 
our study. Therefore, entirely new paradigms are entering our discipline, rad- 
ically changing the way we go about our art. This chapter focuses on one 
such method wherein the data are symbolically valued, i.e., hypercubcs in 
p-dimensional space W, instead of the classically valued points in W. 



29.1 Introduction 

The advent of modern computer capabilities has a consequence that entirely 
new paradigms are entering our discipline radically changing the way we go 
about our art. One hundred years ago, researchers were transitioning from us- 
ing arithmetical arguments when developing their new mathematically-based 
ideas to using algebraic arguments (i.e., mathematical tools, algebra, calculus, 
and the like). Today's transition lies more along the lines of computational 
mathematical/statistical developments as we struggle with the massively huge 
data sets at hand. This chapter focuses on one such method — symbolic data 
— projected by Goodman (2011) as one of the two most important new de- 
velopments on the horizon wherein the data are symbolically valued, i.e., hy- 
percubcs in p-dimensional space M^, instead of points in M.P as for classical 
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data. In Section 29.2, we describe briefly what symbolic data are and how 
they might arise. Then, in Section 29.3, we illustrate some symbolic method- 
ological analyses and compare the results with those obtained when using 
classical surrogates. Some concluding remarks about the future of such data 
are presented in Section 29.4. 



29.2 Symbolic data 

Symbolic data consist of lists, intervals, histograms and the like, and arise 
in two broadly defined ways. One avenue is when data sets of classical point 
observations are aggregated into smaller data sets. For example, consider a 
large medical data set of millions of individual observations with informa- 
tion such as demographic (e.g., age, gender, etc.), geographical (e.g., town 
of residence, country, region, ...), basic medical diagnostics (pulse rate, blood 
pressure, weight, height, previous maladies and when, etc.), current ailments 
(e.g., cancer type such as liver, bone, etc.; heart condition, etc.), and so on. 
It is unlikely the medical insurer (or medical researcher, or...) is interested in 
the details of your specific visit to a care provider on a particular occasion; 
indeed, the insurer may not even be interested in your aggregated visits over a 
given period of time. Rather, interest may focus on all individuals (and their 
accumulated history) who have a particular condition (such as heart valve fail- 
ure), or, maybe interest centers on the collection of individuals of a particular 
gender-age group with that condition. Thus, values are aggregated across all 
individuals in the specific categories of interest. It is extremely unlikely that 
all such individuals will have the same pulse rate, the same weight, and so 
forth. Instead, the aggregated values can take values across an interval, as a 
histogram, as a list of possible values, etc. That is, the data set now consists 
of so-called symbolic data. 

Automobile insurers may be interested in accident rates of categories such 
as 26-year-old male drivers of red convertibles, and so on. Census data are 
frequently in the form of symbolic data; e.g., housing characteristics for regions 
may be described as {owner occupied, .60; renter occupied, .35; vacant, .05} 
where 60% of the homes are owner occupied, etc. 

There are countless examples. The prevailing thread is that large data sets 
of single classical observations are aggregated in some way with the result 
that symbolic data perforce emerge. There are a myriad of ways these original 
data sets can be aggregated, with the actual form being driven by the scientific 
question/s of interest. 

On the other hand, some data are naturally symbolic in nature. For ex- 
ample, species are typically described by symbolic values; e.g., the mushroom 
species bernardi has a pileus cap width of [6, 7] cm. However, the particular 
mushroom in your hand may have a cap width of 6.2 cm, say. Pulse rates 
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bounce around, so that an apparent rate of 64 (say) may really be 64 ± 2, i.e., 
the interval [62,66]. There are numerous examples. 

Detailed descriptions and examples can be found in Bock and Diday (2000) 
and Billard and Diday (2006). A recent review of current methodologies is 
available in Noirhomme-Fraiturc and Brito (2011), with a non-technical in- 
troduction in Billard (2011). The original concept of symbolic data was intro- 
duced in Diday (1987). Note that symbolic data are not the same as fuzzy data; 
however, while they are generally different from the coarsening and grouping 
concepts of, e.g., Heitjan and Rubin (1991), there are some similarities. 

The major issue then is how do we analyse these intervals (or, histo- 
grams, ...)? Taking classical surrogates, such as the sample mean of aggre- 
gated values for each category and variable, results in a loss of information. 
For example, the intervals [10, 20] and [14, 16] both have the same midpoint; 
yet they are clearly differently valued observations. Therefore, it is important 
that analytical methods be developed to analyse symbolic data directly so as 
to capture these differences. There are other underlying issues that pertain 
such as the need to develop associated logical dependency rules to maintain 
the integrity of the overall data structure; we will not consider this aspect 
herein however. 



29.3 Illustrations 

Example 29.1 Table 29.1 displays (in two parts) a data set of histogram 
valued observations, extracted from Falduti and Taibaly (2004), obtained by 
aggregating by airline approximately 50,000 classical observations for flights 
arriving at and departing from a major airport. For illustrative simplicity, 
we take three random variables Yi = flight time, Y2 = arrival-delay-time, 
and Y3 = departure-delay-time for 10 airlines only. Thus, for airline u = 
1, . . . , 10 and variable j = 1, 2, 3, we denote the histogram valued observation 
by Yuj = {[a„jfc, bujk),Pujk ■■ k = 1,..., where the histogram sub-interval 
[aujk.Kjk) lias relative frequency p^jk with Y.kPnjk = 1- The number of 
subintervals Suj can vary across observations [u) and across variables (j). 

Figure 29.1 shows the tree that results when clustering the data by a 
Ward's method agglomerative hierarchy algorithm applied to these histogram 
data when the Euclidean extended Ichino-Yaguchi distance measure is used; 
see Ichino and Yaguchi (1994) and Kim and Billard (2011, 2013), for details. 
Since there are too many classical observations to be able to build an equiva- 
lent tree on the original observations themselves, we resort to using classical 
surrogates. In particular, we calculate the sample means for each variable 
and airline. The resulting Ward's method agglomerative tree using Euclidean 
distances between the means is shown in Figure 29.2. 
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TABLE 29.1 

Airline data. 





[0,70) 


[70,110) 


[110,150) 


[150,190) 


[190,230) 


[230, 270) 


1 


.00017 


.10568 


.33511 


.20430 


.12823 


.045267 


2 


.13464 


.10799 


.01823 


.37728 


.35063 


.01122 


3 


.70026 


.22415 


.07264 


.00229 


.00065 


.00000 


4 


.26064 


.21519 


.34916 


.06427 


.02798 


.01848 


5 


.17867 


.41499 


.40634 


.00000 


.00000 


.00000 


6 


.28907 


.41882 


.28452 


.00683 


.00076 


.00000 


7 


.00000 


.00000 


.00000 


.00000 


.03811 


.30793 


8 


.39219 


.31956 


.19201 


.09442 


.00182 


.00000 


9 


.00000 


.61672 


.36585 


.00348 


.00174 


.00000 


10 


.76391 


.20936 


.01719 


.00645 


.00263 


.00048 


Y2 


[-40, -20) 


[-20,0) 


[0,20) 


[20,40) 


[40, 60) 


[60, 80) 


1 


.09260 


.38520 


.28589 


.09725 


.04854 


.03046 


2 


.09537 


.45863 


.30014 


.07433 


.03226 


.01683 


3 


.12958 


.41361 


.21008 


.09097 


.04450 


.02716 


4 


.06054 


.44362 


.33475 


.08648 


.03510 


.01865 


5 


.08934 


.44957 


.29683 


.07493 


.01729 


.03746 


6 


.07967 


.36646 


.28376 


.10698 


.06070 


.03794 


7 


.14024 


.30030 


.29573 


.18293 


.03659 


.01067 


8 


.03949 


.40899 


.33727 


.12483 


.04585 


.02224 


9 


.07840 


.44599 


.21603 


.10627 


.04530 


.03310 


10 


.10551 


.55693 


.22989 


.06493 


.02363 


.01074 


Ys 


[-15,5) 


[5,25) 


[25,45) 


[45, 65) 


[65,85) 


[85, 105) 


1 


.67762 


.16988 


.05714 


.03219 


.01893 


.01463 


Z 


.»4yyo 


.U { zyo 


.UoUOD 


.Uiyo4 


.UlDoo 


nri/i 01 
.UU4zi 


3 


.65249 


.14071 


.06872 


.04025 


.02749 


.01669 


4 


.77650 


.14516 


.04036 


.01611 


.01051 


.00526 


5 


.63112 


.24784 


.04323 


.02017 


.02882 


.00288 


6 


.70030 


.12064 


.06297 


.04628 


.02049 


.01290 


7 


.73323 


.16463 


.04726 


.01677 


.01220 


.00305 


8 


.78711 


.12165 


.05311 


.01816 


.00772 


.00635 


9 


.71080 


.12369 


.05749 


.03310 


.01916 


.00523 


10 


.83600 


.10862 


.03032 


.01408 


.00573 


.00286 



It is immediately apparent that the trees differ, even though both have 
the same "determinant" — agglomerative. Ward's method, and Euclidean 
distances. However, one tree is based on the means only while the other is 
based on the histograms; i.e., the histogram tree of Figure 29.1, in addition to 
the information in the means, also uses information in the internal variances of 
the observed values. Although the details are omitted, it is easy to show that, 
e.g., airlines (1, 2, 4) have similar means and similar variances overall; however, 
by omitting the information contained in the variances (as in Figure 29.2), 
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TABLE 29.1 

Airline data (continued). 





[270,310) 


[310,350) 


[350,390) 


[390,430) 


[430, 470) 


[470, 540] 


1 


.07831 


.07556 


.02685 


.00034 


.00000 


.00000 


2 


.00000 


.00000 


.00000 


.00000 


.00000 


.00000 


3 


.00000 


.00000 


.00000 


.00000 


.00000 


.00000 


4 


.03425 


.02272 


.00729 


.00000 


.00000 


.00000 


5 


.00000 


.00000 


.00000 


.00000 


.00000 


.00000 


6 


.00000 


.00000 


.00000 


.00000 


.00000 


.00000 


7 


.34299 


.21494 


.08384 


.01220 


.00000 


.00000 


8 


.00000 


.00000 


.00000 


.00000 


.00000 


.00000 


9 


.00523 


.00174 


.00348 


.00000 


.00000 


.00174 


10 


.00000 


.00000 


.00000 


.00000 


.00000 


.00000 


y2 


[80, 100) 


[100, 120) 


[120, 140) 


[140,160) 


[160,200) 


[200, 240J 


1 


.01773 


.01411 


.00637 


.00654 


.01532 


.00000 


2 


.01403 


.00281 


.00281 


.00000 


.00281 


.00000 


3 


.02094 


.01440 


.01276 


.00884 


.02716 


.00000 


4 


.00797 


.00661 


.00356 


.00051 


.00220 


.00000 


5 


.00865 


.00576 


.00576 


.00576 


.00865 


.00000 


6 


.02883 


.00835 


.01366 


.00835 


.00531 


.00000 


7 


.00762 


.00305 


.00152 


.00762 


.01372 


.00000 


8 


.00817 


.00635 


.00227 


.00136 


.00318 


.00000 


9 


.01916 


.01394 


.00871 


.01220 


.02091 


.00000 


10 


.00286 


.00143 


.00167 


.00095 


.00143 


.00000 


>3 


[105,125) 


[125, 145) 


[145, 165) 


[165,185) 


[185,225) 


[225, 265] 


1 


.00878 


.00000 


.00361 


.00947 


.00775 


.00000 


2 


.00281 


.00000 


.00000 


.00140 


.00140 


.00000 


3 


.01407 


.00000 


.01014 


.01407 


.01538 


.00000 


4 


.00305 


.00000 


.00085 


.00068 


.00153 


.00000 


5 


.00865 


.00000 


.00865 


.00000 


.00865 


.00000 


6 


.01897 


.00000 


.00986 


.00607 


.00152 


.00000 


7 


.00457 


.00000 


.00152 


.00762 


.00915 


.00000 


8 


.00227 


.00000 


.00136 


.00045 


.00182 


.00000 


9 


.01045 


.00000 


.01742 


.01394 


.00871 


.00000 


10 


.00095 


.00000 


.00072 


.00048 


.00024 


.00000 



while airlines (1,2) have comparable means, they differ from those for airline 4. 
That is, the classical surrogate analysis is based on the means only. 

A polythetic divisive tree built on the Euclidean extended Ichino-Yaguchi 
distances for the histograms is shown in Figure 29.3; see Kim and Billard 
(2011) for this algorithm. The corresponding monothctic divisive tree is com- 
parable. This tree is different again from those of Figures 29.1 and 29.2; these 
differences refiect the fact that different clustering algorithms, along with dif- 
ferent distance matrices and different methods, can construct quite differ- 
ent trees. 
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FIGURE 29.1 

Ward's method agglomerative tree based on histograms. 



Example 29.2 Figure 29.4 displays simulated individual classical observa- 
tions (Yi,y2) drawn from bivariate normal distributions A/'2(/.t, S). There 
are five samples each with n = 100 observations. Sample 5=1 has mean 
H = (5,0), standard deviations cri = cr2 = -25 and correlation coefficient 
/9 = 0; samples S=2,3 have /i. = (1, 1), (Ti = (T2 = .25 and p — and samples 
5 = 4, 5 have /i, = (1, 1), ai — cr2 ~ 1 and p ~ .8. Each of the samples can be 
aggregated to produce a bivariate histogram observation Y^., s = 1, . . . , 5. 

When a divisive algorithm for histogram data is applied to these data, three 
clusters emerge containing the observations Ci = {^i}, C2 = {^21^3}! and 
C3 = {I4, Y5}, respectively. In contrast, applying algorithms, e.g., a X-means 



FIGURE 29.2 

Ward's method agglomerative tree based on means. 
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CDi-oa:>co^c30C\j"^ 

FIGURE 29.3 

Polythetic divisive tree based on histograms. 




-2 0 2 4 6 



FIGURE 29.4 

Simulated data — How many clusters? 
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method, to the classical observations (or, to classical surrogates such as the 
means), identifies only two clusters, viz., Ci = {Yi} and C2 = {Y2, Y3, Y4, Ys}. 
That is, since information such as the internal variations is not part of a 
classical analysis, the classical clustering analyses are unable to identify ob- 
servations Y2 and Y3 as being different from observations and Y^. 

Example 29.3 Consider the data of Table 29.2 where for simplicity, we re- 
strict attention to minimum and maximum monthly temperatures for four 
months only, January, April, July, and October, Yi — 14, respectively, in 1988 
for n = 10 weather stations. The interval values for station u = 1, . . . , 10 
and variable j = 1,...,4 are denoted by Y^j = [auj,buj]- Elevation is also 
included as Y^; note Y^ is a classical value and so is a special case of an 
interval value with a„5 = [a„5,a„5]. The data are extracted from http: 
//dss. ucar.edu/datasets/ds578. 5 which contain annual monthly weather 
values for several variables for many stations in China over many years. 



TABLE 29.2 

Temperature intervals and elevation. 



Station 


January 


April 


July 


October 


Elevation 


u 


[aui,Ki] 






[au4, bu4] 




1 


hl8.4,-7.5J 


[-0.1,13.2] 


[17.0,26.5] 


[0.6,13.1] 


4.82 


2 


[-20.0,-9.6] 


[0.2,11.9] 


[17.8,27.2] 


[-0.2,12.5] 


3.44 


3 


[-23.4,-15.5] 


[-4.5,9.5] 


[12.9,23.0] 


[-4.0,8.9] 


14.78 


4 


[-27.9,-16.0] 


[-1.5,12.0] 


[16.1,25.0] 


[-2.6,10.9] 


4.84 


5 


[-8.4,9.0] 


[1.7,16.4] 


[10.8,23.2] 


[1.4,18.7] 


73.16 


6 


[2.3,16.9] 


[9.9,24.3] 


[17.4,22.8] 


[14.5,23.5] 


32.96 


7 


[2.8,16.6] 


[10.4, 23.4] 


[16.9,24.4] 


[12.4,19.7] 


37.82 


8 


[10.0,17.7] 


[15.8,23.9] 


[24.2,33.8] 


[19.2,27.6] 


2.38 


9 


[11.5,17.7] 


[17.8,24.2] 


[25.8,33.5] 


[20.3,26.9] 


1.44 


10 


[11.8,19.2] 


[16.4, 22.7] 


[25.6,32.6] 


[20.4,27.3] 


0.02 



A principal component analysis on these interval-valued data produces 
the projections onto the PCi x PC2 space shown in Figure 29.5. Details 
of the methodology and the visualization construction can be found in Le- 
Rademacher and Billard (2012), and further details of this particular data set 
are in Billard and Le-Rademacher (2012). 

Notice in particular that since the original observations are hypercubes 
in W space, so we observe that the corresponding principal components are 
hypercubes in PC-space. The relative sizes of these PC hypercubes reflect the 
relative sizes of the data hypercubes. For example, if we compare the observed 
values of stations u = 5 and u = 10 in Table 29.2, it is clear that the inter- 
vals across the variables for u = 5 are on balance wider than are those for 
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FIGURE 29.5 

PCA based on intervals. 



u = 10; thus, the principal component hypercube is larger for m = 5 than for 
u = 10. That is, the observation u = 5 has a larger internal variation. These 
internal variations are a component of the covariance terms in the covariance 
(and correlation) matrix. This feature is not possible in a classical analysis, 
with the point observation in W being transformed into but a point value in 
PC-space, as shown in Figure 29.6 for the classical principal component anal- 
ysis performed on the interval means. While both the symbolic and classical 
analyses showed the temperatures as being of comparable importance to PCi 
with elevation being important only for PC2, the visualizations through the 
PC hypercubes of Figure 29.5 are more informative than are the PC points of 
Figure 29.6. 



29.4 Conclusion 

By the time that Eddy (1986) considered the future of computers in statistical 
research, it was already clear that a computer revolution was raising its head 
over the horizon. This revolution was not simply focussed on bigger and better 
computers to do traditional calculations on a larger scale, though that too was 
a component, then and now. Rather, more expansively, entirely new ways of 
approaching our art were to be the new currency of the looming 21st century. 
Early signs included the emergence of new methodologies such as the boot- 
strap (Efron, 1979) and Gibbs sampler (Geman and Gcman, 1984), though 
both owed their roots to earlier researchers. While clearly these and similar 
computational methodologies had not been feasible in earlier days thereby be- 
ing a product of computer advances, they are still classical approaches per se. 
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PC, 



PC, 



FIGURE 29.6 

PCA based on means. 



By the 1990s, COPSS Presidential Addresses referred to the upcoming infor- 
mation and technological revolution, its waves already heading for the unwary; 
see, e.g., Kettenring (1997) and Billard (1995, 1997). 

However, the real advances will take quite different formats to those pre- 
dicted in the 1990s. In a very prescient comment, Schweizer (1984) declared 
that "distributions are the numbers of the future." The present is that future. 
Furthermore, today's future consists of a new paradigm whereby new method- 
ologies, and new theories to support those methodologies, must be developed 
if we are to remain viable players as data analysts. These new methods must 
also be such that the classical models of the still-present and past come out 
correctly as special cases of the new approaches. 

In this chapter, one such approach, viz., symbolic data, has been described 
albeit ever so briefly. While a study of the literature may at first suggest there 
are many symbolic techniques currently available, in reality there are very few 
and even then those few handle relatively narrowly defined situations. 

There are two major directions for future work: one is to develop the new 
methodologies for new data structures and to extend the plethora of situations 
that a century or more of research in so-called classical statistics produced, 
while the other is to establish mathematical underpinning to support those 
new methods (somewhat akin to the theoretical foundations provided initially 
by Bickel and Freedman (1981), and Singh (1981), which validated the early 
bootstrap work). One certainty is sure — the present-future demands that we 
engage our energies in addressing the myriad of issues surrounding large and 
complex data sets. It is an exciting time to be a part of this undertaking. 
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